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Testing the Approximate Validity of Statistical Hypotheses 

By J. L. Hodges, Jr. and E. L. Lehmann* 
University of California, Berkeley 
[Received March, 1954] 

SUMMARY 

The distinction between statistical significance and material significance in hypotheses 
testing is discussed. Modifications of the customary tests, in order to test for the 
absence of material significance, are derived for several parametric problems, for 
the chi-square test of goodness of fit, and for Student's hypothesis. The latter 
permits one to test the hypothesis that the means of two normal populations of 
equal variance, do not differ by more than a stated amount. 

1. Introduction 

When testing statistical hypotheses, we usually do not wish to take the action of rejection 
unless the hypothesis being tested is false to an extent sufficient to matter. For example, we 
may formulate the hypothesis that a population is normally distributed, but we realize that no 
natural population is ever exactly normal. We would want to reject normality only if the departure 
of the actual distribution from the normal form were great enough to be material for our investi- 
gation. Again, when we formulate the hypothesis that the sex ratio is the same in two populations, 
we do not really believe that it could be exactly the same, and would only wish to reject equality 
if they are sufficiently different. Further examples of the phenomenon will occur to the reader. 

In practice, this imprecision in the formulation does not usually cause much trouble, since the 
tests employed are sufficiently lacking in power that we do not run serious risk of rejecting the 
hypothesis unless it is false to a considerable extent. But whenever the available data are extensive, 
the tests may become embarrassingly powerful, leading to a paradox enunciated by Berkson 
(1938): 

"I believe that an observant statistician who has had any considerable experience with applying 
the chi-square test repeatedly will agree with my statement that, as a matter of observation, when 
the numbers in the data are quite large, the P's tend to come out small. Having observed this, 
and on reflection, I make the following dogmatic statement, referring for illustration to the normal 
curve: 'If the normal curve is fitted to a body of data representing any real observations whatever 
of quantities in the physical world, then if the number of observations is extremely large — for 
instance, on the order of 200,000 — the chi-square P will be small beyond any usual limit of signi- 
ficance.' 

"This dogmatic statement is made on the basis of an extrapolation of the observation referred 
to and can also be defended as a prediction from a priori considerations. For we may assume 
that it is practically certain that any series of real observations does not actually follow a normal 
curve with absolute exactitude in all respects, and no matter how small the discrepancy between 
the normal curve and the true curve of observations, the chi-square P will be small if the sample 
has a sufficiently large number of observations in it. 

"If this be so, then we have something here that is apt to trouble the conscience of a reflective 
statistician using the chi-square test. For I suppose it would be agreed by statisticians that a 
large sample is always better than a small sample. If, then, we know in advance the P that will 
result from an application of a chi-square test to a large sample there would seem to be no use 
in doing it on a smaller one. But since the result of the former test is known, it is no test at all!" 

It seems to us that this difficulty can be avoided by making a clear distinction, in the formulation 
of the problem, between "statistical significance" and what might be called "material significance". 

* This paper was prepared with the partial support of the Office of Naval Research, U.S.A. 
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In the space of the parameters we may distinguish a set H 0 of values, representing the idealized 
hypothesis as it is customarily formulated. About the set H 0 we may then distinguish a larger set 
H x of values, representing situations close enough to H 0 that the difference is not materially 
significant in the problem at hand. If we knew that some distribution 6 in H x were the true one, 
we should still wish to accept the hypothesis idealized by H 0 . The size of the test should be the 
maximum value of the power function (3(0) in H l9 not H 0 . Outside of H l9 however, we want to 
reject, so that consistency outside of H 1 is no longer undesirable. We reject as soon as there is 
statistically significant evidence that the departure from H 0 is materially significant. 

It might be objected that there is nothing novel in the point of view just presented. We may 
just forget about H 0 and let Hx play the role of the hypothesis in the classical formulation. There 
is however often a practical advantage in keeping H 0 , as it is mathematically simple and corresponds 
to the idea underlying the situation to be tested. The boundaries of H x will be less precise in the 
experimenter's mind. It will usually be best to introduce into the space of parameters a measure, 
say A(0) of the "distance" of 0 from #0 on a scale reflecting at least roughly the materiality of 
departures from H 0 , and then define H x as the set of those 0 for which A(0) does not exceed a 
specified value A 0 . The choice of A 0 will present problems similar to those encountered in choosing 
the alternative at which specified power is to be obtained. 

If such a formulation is adopted, it is necessary to modify the customary tests, so as to adjust 
them to the new situation. This will be done briefly in section 2, for a few simple parametric 
situations which require only trivial modifications of the standard tests. In section 3 a modification 
of Student's problem is treated, and in section 4 a discussion from the present point of view is 
given of the x 2 -test for goodness of fit. 

2. Some Parametric Problems 

The simplest situation of the kind described in the introduction arises when a single parameter 
is involved, as for example in the case of a sample from a binomial or Poisson variable. We 
then have a (vector-valued) random variable X 9 with generalized probability density p e . For 

testing H : 0 = 0 O we have available a test statistic T such that 

P e (a<T<b) = G a , b (d) (2.1) 

is a unimodal function of 0. For testing H : 0 = 0 O we would adjust a and b such that G a> 6 (0) 
takes on its maximum at 0 = 0 O and that the value of this maximum is 1 — a. If now instead 
we test J^* : 0i ^ 0 ^ 0 2 we adjust a, b 50 that G a , &(0i) = G a> & (0 2 ) = 1 — a. It then follows 
that the power function (3(0) = 1 - G a> & (0) satisfies (3(0) <; a for 0 X < 0 <; 0 2 and p(0) > oc 
for 0 < 0x and 0 2 < 0. 

The following are some examples of this approach : 

(i) X has a binomial distribution with probability 0 of success. We can then take T = X. 

(ii) X l9 . . . , X n is a sample from a Poisson distribution with E(X t ) = 0. We take T = SA^-. 

(iii) X l9 . . . , X n is a sample from a normal distribution, mean 5, variance a 2 . We wish to 
test'tf* : ai S a ^ a 2, and take T= 2(X - X)\ 

(iv) X l9 . . . , X n is a sample from a normal distribution, mean 5, variance a 2 , and we wish to 
test' if* : 8 0 <; £/cr <; S x . We may then take 



V{sui - xyy 

This is a special case of the general linear hypothesis which we consider later in this section. 
Certain two-sample problems immediately reduce to situations of the kind just considered. 
In particular, let X l9 . . . , X m ; Y l9 . . . , Y n be samples from two normal populations with 
means 5, tj and variances a 2 and t 2 respectively. Then we can test 

(v) the hypothesis H* :a <, <j 2 /t 2 <i b by putting T= 2(Z; - XYI^Y, - Y)\ 
If we assume in addition that a = t we can base a test of the hypothesis 

(vi) H* : S 0 ^ ^ — Ql a S s i on tne statistic 

Y-X 

T ~ vot - xy + s(r, - fry 
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The case of two Poisson or binomial populations also can be reduced to a one-parameter 
problem of the kind already considered. 

(vii) Suppose that X, Y are two independent Poisson variables with means X and \i respectively, 
and that we wish to test H* : a ^ X/ < b. As is well known and easily checked, the conditional 
distribution of Y given X + Y = m is the binomial distribution of m trials and success probability 
p = + pi) = 1/(1 + X/ja). Thus we obtain a test of H* by making on each line X + Y = m 
a conditional binomial test of the hypothesis 1/(1 + b) < p ^ 1/(1 + a). 

In a completely analogous manner we can test H* : a < {pilq-dliPzlqd ^ b where X, Y are 
independent binomial variables with probabilities p 1} p 2 of success respectively, and where 
qt = 1 — p^ since again the conditional distribution of Y given X + Y = m depends on p v p 2 
only through 0 = 0>i/tfi)/(W#2). 

As a last example consider the general univariate linear hypothesis, which we shall take in 
the canonical form, according to which Y l9 . . . , Y r ; Y r+li . . . , Y s ; Y s+1 , , , . , Y n are 
independently distributed with common (unknown) variance a 2 and means 

E(Yd — t]i i = 1, . . . , s; E{Y t ) = 0 f = * + 1, . . . , n. 

The usual hypothesis H :v\ 1 = . . . =Y) r = 0is tested by rejecting when 

i YMr 

(2.2) 

S Y\l(n-s) 

i=s+l 

r 

is too large. Here W has a noncentral F-distribution with noncentrality parameter X = 2 v] 2 ,/a 2 

/= l 

and in fact the probability P(W > C) is a strictly increasing function of X. From this it is obvious 
how to test H* : 2t)V<j 2 < X 0 . We again reject when W exceeds a constant C, but instead of 
determining C from the central F-distribution, we now determine it from the equation 

P(W> C\ X- X 0 ) = a. 

It is interesting to note that this is the uniformly most powerful test for testing H* among all 
those based on W (or equivalently that it is the most powerful invariant test for testing H* ; see, 
for example, Lehmann (1950)). To obtain the test having this property we must consider the 
probability ratio test based on W for testing Sy) 2 ,/<j 2 = X 0 against the alternative Svj^/a 2 = X l9 
which rejects when 

'-^ (2.3) 

is too large. (Here p x denotes the density of W for parameter value X). As we shall show below, 
(2 . 3) is an increasing function of w and this test is therefore equivalent to rejecting when w > C. 
From this it is then seen that the probability P{p^{W)lp^ (W) >_k) is an increasing function 
of the true parameter value X, and the result follows. 

It only remains to show that the ratio (2.3) is an increasing function of w. (This fact has 
been known for some time. The proof which follows is essentially the same as that given by 
Paul L. Meyer in An Application of the Invariance Principle to the Student Hypothesis, Technical 
Report No. 24, Department of Statistics, Stanford University, California.) We have 

where 0 < w and the c k are positive constants. Hence, if we put z = w/2(l -f w) 9 a k = c k X 0 2k /k\ 
and b 7c = c k X^/kl, we have 

Ph W *b k z k ... 
^) = ^?^ /( ^ Say ' 

VOL. XVI. NO. 2. S 
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It is easy to show that the derivative of f(z) is given by 

&a k z k Yf'(z) = S (n-k) (a k b n - a n b k ) z k+n ~\ 

k<n 

which is positive since b k /a k < b n ja n for k < n. It follows that p Xi (w)/p Xq (w) is an increasing 
function of z and hence of w. 

3. Studenfs Problem 

We next consider the modification of the classical Student testing problem in which absolute 
units are used, rather than ounits as was done in example (iy) above. Let X have a normal 
distribution with E(X) = 5, V(X) = a 2 , and let S, independent of X, be such that S/a has the chi 
distribution with m degrees of freedom. The problem of testing the hypothesis that 5 equals a 
specified value £ 0 > with a playing the role of a nuisance parameter, is often formulated. For 
example, it arises when testing whether two normal populations, of the same but unknown variance, 
have the same mean. In line with the arguments advanced above, we feel that in many appli- 
cations it would be more realistic to formulate as our hypothesis the assertion that 5 does not 
differ from 5o by more than a specified quantity chosen in the light of the material problem. For 
instance, instead of testing that two normal populations have the same mean, we would test that 
their means do not differ by more than an amount specified to represent the smallest difference 
of practical interest. 

One possible test of the modified hypothesis that 5i 5j 5 5 2> which uses only existing tables, 
consists in performing separate one-tailed /-tests of the two one-sided hypotheses: 

H : 5 ^ 5 2 against alternative 5 > ^ 

and 

H : 5 ^ £i against alternative 5 < £i. 

We then reject if either of these separate tests rejects. The size of this composite /-test is the sum 
of their separate sizes, as is seen by letting a tend to oo. (By the size of a test we mean as usual 
the upper bound of its probability of first kind error.) Thus, we would obtain a test of size a 
if each of the one-sided /-tests was carried out at level a/2. The main objection to the composite 
/-test solution of our problem lies in the fact that, when a is small, the test has the power only 
of the /-test with level a/2, although we are paying for a test of size a. The obvious remedy is to 
seek an unbiased test. Such a test will of necessity have power identically a when £ = Si or 
5 = £ 2 ; that is, it will be "similar on the boundary". For simplicity we shall assume throughout 
that a <; and make a scale change to insure 5i = — 1, 5 2 = 1- 

In the present paragraph we consider the right boundary, that is, we assume 5=1. We 
may specify the location of the sample point (x, s) by the polar coordinates 0, 0) defined by 
r 2 = (x — l) 2 + 5 2 , sin 6 = s/r. It is well known that the corresponding random variables R 9 
0 are independent and that P[0 <; 6] = i 7 sin20 (im, i) for 0 <; 6 tu/2, where / is the 
Incomplete Beta function. The distribution of © is further symmetric about tu/2. It is known 
(Lehmann and Scheffe, 1950) that (X - l) 2 + S 2 = R 2 is complete sufficient statistic for a, 
so that the only tests of size a which are similar for 5 = 1 are those whose critical regions intersect 
each semicircle R = r in a set of points whose conditional probability is a. 

As the analogous argument goes through for I = — 1, we have a double condition on the 
test region. We also impose the intuitively reasonable requirements that the test region be 
symmetric about the s-axis, and that for given values of R we want to reject for extreme values 
of 6. The problem now formulated is to construct a test region satisfying these conditions. We 
shall carry through this construction in detail in the simple case m = 1, where an explicit solution 
can be given. 

When m = 1, © has the uniform distribution over (0, tu), so that we must take from each semi- 
circle an arc of size rua. The construction is made inductively on increasing values of r. From 
semicircles of radius r <; 2 = r u we take arcs of size rua adjacent to the positive x-axis (see Fig. 1). 
This gives the segment P 0 ^i as part of the boundary of the rejection region. By the symmetry 
requirement, we must also reject at points below the segment Q 0 Q l9 obtained by reflecting P Q P 1 
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about the 5-axis. Let r 2 denote the distance from P 0 to Q l9 and next consider the semicircles 
with radii r± ^ r ;§ r 2 . From these we have already taken as rejection points the arcs below 
QoGij which are the same size as the arcs below the segment obtained by translating P 0 Pi 
two units to the right. To bring the rejection arcs up to size rox, we cut off the additional rejection 
arcs bslow iW The notable fact, obvious from the symmetry of the picture, is that the boundary 
portion P 1 P 2 is a horizontal straight line segment. The construction proceeds inductively, produc- 
ing a boundary of the rejection region consisting of straight line segments of length 2, alternately 
horizontal and of slope rax, together of course with its reflection about the ,y-axis. 




The test we have obtained is by its construction similar on the boundary £ == ± 1 ; to verify 
that it is in addition unbiased, we need only observe that for each given value of s, we accept for 
x in an interval centered at 0. Therefore, the conditional probability of rejection, given s, is a 
monotonely increasing function of | 5 | , whatever s may be. The same must then hold for 
the power function as a function of | £ | for any given value of a. 

For comparison, we show as the dashed line of the Figure the boundary of the composite 
Mest of size a. The shaded region represents the additional rejection points which the new test 
affords without increase in size. It is these points which provide our test with its greater power 
for finite a. 

So far we have been considering only the case m = 1, where the problem is simple and we 
can give an explicit solution. When we turn to the general case things are not so easy. The 
inductive construction may still be employed, alternately translating boundary segments to the 
right by two units and then generating new segments by the requirement that a conditional 
probability a be taken on each arc. This leads to the relation 

( r y = 4r cos 6 + r 2 + 4 

/(sin* 60 = 2a -/[^] 

where 0, 0) and (r\ 6') are two boundary points, and we write I(x) for I x (im, £). These equations, 
together with the initial values 6 = arc sin VU -1 ^)} for 0 < r ^ 2, permit the effective computa- 
tion of a boundary curve C. It appears from the numerical computations underlying the chart 
that C has positive slope everywhere, but we have not found a proof of this. From the assumption 
that C has positive slope it would follow (as for m = 1) that the test region with boundary C 
is unbiased, and at least approximate unbiasedness is in any case guaranteed by the numerical 
results. 
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Fig. 2 shows critical values of the 5 per cent, test for various degrees of freedom m. Suppose 
X is normal, E(X) = 5, V(X) = a 2 , S 2 /cr 2 has the chi-square distribution of m degrees of freedom, 
X and S are independent, and we wish to test | I | <L 1. If | | < = 1, accept. Otherwise, 
let t be (X - 1) y/mjS if X > 1, or (- 1 - X) y/mjS if X < - 1. Read the critical value t 0 
from Fig. 2 corresponding to the value of m and of y/m/S. Reject if t > t 0 . 

To illustrate the test, suppose Y l9 . . . , is a sample from the normal population of expecta- 
tion 7) and variance t 2 , while Z 1} . . . , Z n is a sample from the normal population of expectation C 
and variance t 2 . We wish to test the hypothesis that the absolute difference | v) — K | of the 
population means does not exceed A. If we let kY = ^Y i9 nZ = SZ^, X = {Y — Z)/A, 
I = (n - Q/A, a 2 = (k + n)T 2 lknA\ S 2 = (k + {2(7, - f) 2 + S(Z, - Z) 2 }//b* A 2 , and 
m = k ~\- n — 2, then the assumptions of the test are satisfied. 

In the computations for the chart good use was made of the approximate value t aj2 — y/m/S 
for t 0 (y/m/S). This approximation rests on the fact that as r-> go, the distance of the point 
(r, 0) in the first quadrant from the line y = x<\/ra/? a/2 tends to 0. The recursion formulae 
(3.1) were used until the error of this approximation was less than 0.1, and thereafter the error 
was interpolated quadratically. 

Inspection of Fig. 2 shows that the test has an intuitively reasonable form. When sjy/m 
is very small, it is plausible that a is small, and the two sides do not interfere with each other. 
We may then safely use separate one-tailed t- tests each of level a; this corresponds to the flat 
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portions of the curves. When s/\/m is large, our hypothesis is not essentially different from the 
classical Student hypothesis, and the two-tailed /-test of size a is appropriate; this is given in 
the left margin of the chart. When m is very large, we may safely take the consistent estimate 
s/\f m as if it were a, and use the obvious normal test available when a is known; this is the lowest 
curve of the chart. For any given finite value of m, the critical value of t changes in a more or 
less regular way, as s\\/m is increased, from one extreme to the other. The waves, visible for 
m = 2 and to a lesser extent for m = 3, are to be expected in the light of the situation for m = 1. 

We conclude by remarking that our test also provides confidence intervals in the usual way. 
For instance, it will give an upper confidence limit for the absolute amount by which the mean 
of a normal population differs from a given quantity, or for the absolute difference of the means 
of two normal populations of the same unknown variance. 



4. Testing for Goodness of Fit 

It is in connection with testing for goodness of fit that the existence of the problem under 
consideration has been pointed out by Berkson and others. We are concerned with a multi- 
nomial distribution with, say, r classes and with the hypothesis that the point p = (p u . . . ,p r ), 
where pi is the probability of the z th class, lies on a specified surface <?. The simplest case is that 
in which the hypothesis specifies the p's completely so that consists of a single point. Let 
&(PiP') be a distance function in the sense that A(p,//) > 0 for all p,p' and A(/?,/?') = 0 if and 
only if p = p'. A typical example is ordinary Euclidean distance or more generally A(/?, p') 
= S wAp'i — Pi) 2 where the weights w { may be functions of p, p\ This includes in particular 
the usual x 2 -measure of distance. Let d(p) denote the smallest distance of a point p from £f so that 

d(p) = MA(p,p') (4.1) 

p'eS 

We shall then test instead of H the hypothesis H* that the true point p lies within a given distance 
of £f 9 that is, the hypothesis 

H* : d(p) <L c (4.2) 

In order to obtain a test of H* we shall assume that <? and A are such that the function^/?) 
possesses continuous first and second partial derivatives. We shall not discuss conditions on 
£P and A which would insure this since such conditions do not appear to be particularly simple 
and since in applications it will always be necessary to obtain the function d(p), and the regularity 
assumption is then easily checked directly. Of course in the particular case that Sf consists of a 
single point, say p°, we simply have d(p) = A (/?, p°) and the condition on d immediately reduces 
to one on A. 

Under these assumptions we propose the following procedure for testing if*. Let 
. . , r) denote the relative frequency in the z* th class and let q = (q l9 . . . , q r ). Then 
if d(q) <L c accept H*. If d{q) > c test the hypothesis 

H / :d(p) = c (4.3) 

by means of minimum x 2 (or some asymptotically equivalent test) in the usual manner. For 
example, we may reject when the minimum (modified) x 2 

n S (qi ~~P i)2 >K (4.4) 

i= 1 a i 

where n is the sample size, where the p t are BAN estimates (Neyman, 1949) of the p { subject to 
H\ and where K is determined from the distribution of Xi 2 , that is, x 2 with 1 degree of freedom 
(See Neyman (1949), p. 267. The one degree of freedom arises because the only restriction on 
Pi is d(pi)= 0). However, since we are subjecting the rejection region to the additional restriction 
d{q) > c, the cut-off point K is determined so that 

P(Xi 2 >*0=2a (4.5) 
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where a is the desired level of significance. This differs from the usual procedure in which the 
right-hand side of (4.5) is taken to be a. 

We shall now prove that the power function $ n (p) of this test has the following property: 

o < 

lim p n (/>) = a as d(p) = c. ..... (4.6) 

«->oo 1 > 

This result follows easily from the theory of minimum-x 2 tests. (For details see Cramer (1946) 
and Neyman (1949).) In particular, if d(p) < c, it follows from the continuity of d and the fact 
that q -> p in probability that d(q) < c with probability tending to one, so that $ n (p) -> 0. Suppose 
next that d(p) > c. Then it follows similarly that d(q) > c with probability tending to one. Also 
it follows from the consistency of the x 2 test that the probability of (4.4) tends to one. Hence 
$ n (p) is the probability of the simultaneous occurrence of two events, say A n and B ni for which 
lim P(A n ) = lim P(B n ) = l. But this implies that lim $ n (p) = lim P(A n n B n ) = 1. 

Let us finally consider the case in which the true probability point, say p°, lies on the surface 
£f'\d{p) = c. Then it is known that riZ{(qi — plflQi) nas tne same limiting distribution as 
[Ey^y/fl (qi — pf)] 2 where the y's are a set of coefficients such that 

(i) The equation of the tangent plane of the surface S? at p° is Ey* (pi — pf) = 0, and 

(ii) S*°Y< = 0,S/V» Y <»= I- 

Therefore, d(q) = d(p°) + S y<fo - Pi 0 ) + 0^ (1/n) and hence, since d(p°) = c, 

d(q) -c>0 implies <\A Ey, - A 0 ) + 0 9 (1/V») > 0. 

Therefore 

P»Q>°) = P{«S{fe< - > ^ > c} 

becomes in the limit equal to 

J°{[E r* ~ Pi°)Y > K> Z Yi ^n( qi - p*) > 0}. 

But if we let Y = Ey 4 - \/n(qi — Pi 0 ), then Y has a limiting normal distribution with zero mean 
and unit variance, and the result follows. 

The test is of course very simple to carry out since in the minimization of x 2 we have only the 
single condition d(p) = c. If, following Neyman, we replace this by the asymptotically equivalent 

a 

condition d(q) + Ea^tf) (p { — qd = c where a { = a^q) = ^ d(q), we get the solution 

qi-pi = (a i -a)q i [d(q)-c}lcj a * .... (4.7) 

where a = Z>aiq i9 <*a 2 = E^a* ~ #) 2 - 

As an example, suppose that we wish to test that all r cell probabilities are equal, and that we 

r 

set AO,/?0 = E(p/ — Pi) 2 - Then the modified hypothesis becomes H* : E (pi — 1/r) 2 < c, 

and we must determine the pi subject to the condition SO* — l//*) 2 = c. We then have 
a { = 2{qi — 1/r) and can compute qi — pi from (4.7). 
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